Create a set of functions that can be used together to segment satellite images into similar regions using k-means clustering, and then create and apply a color mask to areas of water. Specifically:
First let's import a few essentials and look at our satellite image.
In [12]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from PIL import Image, ImageFilter
In [2]:
img0 = Image.open('Segmentation/sat4.jpg')
img0
Out[2]:
There is a lot of texture in the areas around the river, and just a bit in the cultivated land areas. Two easy options worth exploring for reducing the texture are gaussian blur (a well known image filter) and de-noising. While there are many methods of smoothing images, some are only suited to greyscale images. These two for sure work with multi-channel (color) images, so we will try them out.
In [3]:
from skimage.filters import gaussian
# apply the filter (returns an array)
# sigma is parameter for the blur (higher value => more blur)
img1 = gaussian(np.array(img0), sigma=1.5, multichannel=True)
# multiply result by 255 for rgb (current values in range [0,1])
img1 = 255*img1
# convert back to image and show it
img1 = Image.fromarray(img1.astype('uint8')) # uint8 for image format
img1
Out[3]:
In [4]:
from skimage.restoration import denoise_nl_means as dnm
# apply
# h is a parameter that determines how smooth the image is (higher h => more smooth)
img2 = dnm(np.array(img0), h=25, multichannel=True)
img2 = Image.fromarray(img2.astype('uint8'))
img2
Out[4]:
Obviously gaussian blur is insufficient for our needs, because it doesn't preserve the edges within the photo and too much blur would be required to get smoothed colors. The best option is to de-noise the image, leaving edges relatively intact. The only downside is how long the de-noising takes.
This is a pretty simple function that requires only three lines, not including the return.
In [5]:
def pre_process(img, h, multichannel):
from skimage.restoration import denoise_nl_means as dnm
img2 = dnm(np.array(img), h=h, multichannel=multichannel)
# convert array to image
img2 = Image.fromarray(img2.astype('uint8'))
return img2
Test it
In [6]:
pre_process(img0, 25, True)
Out[6]:
Segmenting will require some colors to apply. We can either manually create an array of RGB values, or just use some that are already pre-defined:
For ease of maintenance and comprehension let's split this process into two functions:
image_labels()
: perform the cluster analysis and return a list of labels along with the (mostly) flattened imagesegment_image()
: use the results from image_labels()
to segment the image and return the result
In [7]:
def image_labels(img, n_clusters):
from sklearn.cluster import KMeans as KM
# set seed for consistency in cluster labels
# for demonstraion purposes only
np.random.seed(1234)
# flatten image for clustering (one vector per channel (r,g,b))
flat_img = np.array(img).reshape(-1,3)
# create and fit k-means on img and get labels
km = KM(n_clusters).fit(flat_img)
labels = km.labels_
# scikit-image slic clustering
# from skimage.segmentation import slic
# slic_labels = slic(img1, n_segments=5, sigma=1.5, enforce_connectivity=False)
return flat_img, labels
In [8]:
def segment_image(img, flat_img, labels, cmap):
import seaborn as sns
# create array of colors to use in segmentation
colors = sns.color_palette(cmap, len(np.unique(labels)))
colors = [255*np.array(c) for c in colors]
# assign colors based on labels
seg_img = flat_img.copy()
for i,c in enumerate(labels):
seg_img[i] = colors[c]
# reshape img_seg array and convert to image
seg_img = Image.fromarray(seg_img.reshape(np.array(img).shape))
# we will need the colors for masking
return seg_img, colors
Let's test them all together
In [9]:
pre = pre_process(img0, 25, True)
flat, labels = image_labels(pre, 4)
result, _ = segment_image(pre, flat, labels, 'spectral')
result
Out[9]:
For the 6 possible parameters we can have a grid parameter that takes in an array of combinations to iterate over. We will use itertools to create that grid before running the function.
In [10]:
def image_variations(img, grid):
# init list to store image variations
image_vars = []
# vectors of h- and k-vals
# k := number of clusters
h = grid[:, 0]
k = grid[:, 1]
# only need to handle 6 variations
for i in range(6):
pre_img = pre_process(img, h[i], True)
flat_img, labels = image_labels(pre_img, k[i])
seg_img, _ = segment_image(pre_img, flat_img, labels, 'spectral')
# store image
image_vars.append(np.array(seg_img))
# create single array of images
stack1 = np.hstack([image_vars[0], image_vars[1]])
stack2 = np.hstack([image_vars[2], image_vars[3]])
stack3 = np.hstack([image_vars[4], image_vars[5]])
image_arr = np.vstack([stack1, stack2, stack3])
# convert to image
image_arr = Image.fromarray(image_arr)
return image_arr
In [11]:
from itertools import product
# use itertools product method to create all possible combination from two lists
# first list is of the pre-processing param values we want to test
# second list is of the numer of clusters we want to test
grid = np.array(list(product([25, 30],[3, 4, 5])))
grid.reshape(3,4) # same layout as image array
Out[11]:
In [12]:
img_arr = image_variations(img0, grid)
img_arr
Out[12]:
In [13]:
# original for comparison
img0
Out[13]:
First we need to reproduce this version and look at the color values to determine how to segment the water (this is why our function returns colors)
In [14]:
pre_img = pre_process(img0, 25, True)
flat_img, labels = image_labels(pre_img, 5)
seg_img, colors = segment_image(pre_img, flat_img, labels, 'spectral')
seg_img
Out[14]:
In [15]:
colors
Out[15]:
The river is green, thus the third array (in RGB order) — which is index=2 — is the one being applied. This means class label 2 corresponds to water.
In [16]:
# modify the color assigned to lower left pixel (guaranteed to be water)
# the river is the green color, which corresponds to the third row in the array [r, g, b] order
water_label = 2
colors[water_label] = [0, 255, 255]
# assign colors based on labels
img1_seg = np.array(img0).reshape(-1,3)
for i,c in enumerate(labels):
# apply only if water
if c == water_label:
img1_seg[i] = colors[c]
img1_seg = Image.fromarray(img1_seg.reshape(np.array(img1).shape))
img1_seg
Out[16]:
In [17]:
img0
Out[17]:
The dataset provided can be summarized as follows:
In 2013, students of the Statistics class at FSEV UK were asked to invite their friends to participate in this survey.
Your task is to use agglomerative (hierarchical) cluster analysis to analyze this data. Specifically:
hierarchy.fcluster
method to get cluster labels for 16 clusters from the data
In [4]:
# read data
import pandas as pd
from pandas import DataFrame as DF, Series
data = pd.read_csv('../Targeted/responses.csv')
In [5]:
data.dtypes.value_counts()
Out[5]:
In [6]:
data_num = data.select_dtypes(exclude=[object]).copy()
data_num['gender'] = data.Gender.map({'female': 1, 'male': 0})
data_num.shape
Out[6]:
In [7]:
data_num.dropna(how='any', inplace=True)
data_num.shape
Out[7]:
In [8]:
from scipy.cluster.hierarchy import dendrogram, linkage
# this is what we need for the dendrogram
z = linkage(data_num.drop('gender', axis=1), 'ward')
a given row of z
gives us [idx_a, idx_b, d, count]
, where
idx_a
and idx_b
: the indexes for the two records in our data that were merged in that stepd
: the distance between these data pointscount
: the number of samples in the cluster just created
In [9]:
z.shape
Out[9]:
In [10]:
z[:, -1].max()
Out[10]:
In [13]:
plt.figure(figsize=(20,70))
plt.title('Dendrogram for Survey Data')
plt.xlabel('Distance (ward)')
plt.ylabel('Data Index')
dendrogram(z, orientation='left', leaf_rotation=0)
# plot cutoff line at distance = 49
plt.axvline(49, color='k');
Counting the number of lines that intersect the cutoff, there are 16 clusters.
In [14]:
plt.figure(figsize=(20,10))
plt.title('Dendrogram for Survey Data')
plt.xlabel('Data Index')
plt.ylabel('Distance (ward)')
# p is the number of clusters
# lastp is telling the algorithm to truncate using the number of clusters we set
dendrogram(z, orientation='top', leaf_rotation=90, p=16, truncate_mode='lastp', show_contracted=True);
hierarchy.fcluster
method to get cluster labels for 16 clusters from the data
In [15]:
from scipy.cluster.hierarchy import fcluster
# we get cluster labels by using the linkage array (z) that we created earlier
# t=49 is setting the threshold that we determined before (this will give desired number of clusters)
clusters = fcluster(Z=z, t=49, criterion='distance')
clusters.max()
Out[15]:
There are class labels that are aligned with the rows in our data. Let's add a cluster column
In [16]:
data_num['cluster'] = clusters
In [17]:
# user groupby to get gender-specific counts for each class
g = data_num.groupby(['gender', 'cluster']).size().unstack()
# compute proportions of classes by gender
gp = g.div(g.sum(axis=1), axis=0)
gp.plot.bar(figsize=(14,10));
Interestingly, there appear to be very disticnt differences between men and women. This could be partially due to height/weigth differences.
In [18]:
clusters2 = fcluster(Z=z, t=250, criterion='distance')
In [19]:
data_num['cluster2'] = clusters2
In [20]:
data_num.groupby(['cluster2', 'gender']).size()
Out[20]:
They two top-level classes most definitely are gender-speific. No need to do a significance test here.
Given a set of images, use k-means to perform color quantization to reduce the number of distinct colors in each image. Specifically
In [33]:
def quantize(img, n_colors):
from sklearn.cluster import KMeans as KM
# set seed for consistency in cluster labels
# for demonstraion purposes only
np.random.seed(1234)
shape = np.array(img).shape
if (len(shape) == 3):
# flatten image for clustering (one vector per channel (r,g,b))
flat_img = np.array(img).reshape(-1,3)
else:
# image is greyscale
flat_img = np.array(img).reshape(-1,1)
# create and fit k-means on img
km = KM(n_colors).fit(flat_img)
# get labels and cluster centers
labels = km.labels_
centers = km.cluster_centers_
# color quantize using cluster centers
quant_img = flat_img.copy()
for i,c in enumerate(labels):
quant_img[i] = centers[c]
quant_img = Image.fromarray(quant_img.reshape(shape))
return quant_img
In [34]:
bw = Image.open('Quantization/bw.jpg')
In [35]:
bw_q = quantize(bw, 16)
bw_q
Out[35]:
In [49]:
def batch_reduce(files, n_colors):
import os
reduced = []
images = []
# read images
for f in files:
img = Image.open(f)
images.append(img)
# quantize images
for img in images:
im_q = quantize(img, n_colors)
reduced.append(im_q)
print('Images quantized')
# save reduced as png
root = 'Quantization/images/reduced/'
if not os.path.exists(root):
os.makedirs(root)
for i, im in enumerate(reduced):
im.save(root + 'im_q_{}.png'.format(i))
# save originals as png
root = 'Quantization/images/original/'
if not os.path.exists(root):
os.makedirs(root)
for i, im in enumerate(images):
im.save(root + 'im_{}.png'.format(i))
print('Images saved')
In [43]:
import os
os.listdir('Quantization/images')
Out[43]:
In [48]:
import os
images = os.listdir('Quantization/images')
files = ['Quantization/images/'+f for f in images]
batch_reduce(files, 16)
Check these new files and notice that the reduced images are each roughly half the file size of the originals. The reduced and original files will be in the same location as this notebook.
Create a simple custom pipeline that does the following with the survey data:
In [21]:
data = data_num.drop(['cluster', 'cluster2'], axis=1)
# train/test split
Xtr = data.sample(frac=0.67)
Xts = data[~data.index.isin(Xtr.index)].values
Xtr = Xtr.values
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=None).fit(Xtr)
components = pca.components_
ex_var = pca.explained_variance_ratio_
# select components
sums = ex_var.cumsum()
n_comps = 0
for i, s in enumerate(sums):
if s > 0.9:
n_comps = i
break
# transform train and test sets
pca = PCA(n_components=n_comps).fit(Xtr)
Xtr = pca.transform(Xtr)
Xts = pca.transform(Xts)
# k-means
from sklearn.cluster import KMeans as KM
from sklearn.ensemble import RandomForestClassifier as RF
# just for consistency
np.random.seed(123)
km = KM(n_clusters=16).fit(Xtr)
labels = km.predict(Xtr)
# train RF classifier
rf = RF(n_estimators=500).fit(Xtr, labels)
# make predictions with RF and Cluster and compare
y_rf = rf.predict(Xts)
y_km = km.predict(Xts)
from sklearn.metrics import classification_report as crp
report = crp(y_km, y_rf)
print(report)
There will be a message for UndefinedMetricWarning above--this is fine. So the two don't entirely agree, but they aren't bad for so little data.
In [22]:
n_comps
Out[22]: